Skip to main content

Prediction

Predictions

class eole.predict.prediction.Prediction(src, srclen, pred_sents, attn, pred_scores, estim, tgt_sent, gold_score, word_aligns, ind_in_bucket)

Bases: object

Container for a predicted sentence.

  • Variables:
    • src (LongTensor) – Source word IDs.
    • srclen (List *[*int ]) – Source lengths.
    • pred_sents (List *[*List *[*str ] ]) – Words from the n-best predictions.
    • pred_scores (List *[*List *[*float ] ]) – Log-probs of n-best predictions.
    • attns (List *[*FloatTensor ]) – Attention distribution for each prediction.
    • gold_sent (List *[*str ]) – Words from gold prediction.
    • gold_score (List *[*float ]) – Log-prob of gold prediction.
    • word_aligns (List *[*FloatTensor ]) – Words Alignment distribution for each prediction.

log(sent_number, src_raw='')

Log prediction.

class eole.predict.prediction.PredictionBuilder(vocabs, n_best=1, replace_unk=False, phrase_table='', tgt_eos_idx=None, id_tokenization=False)

Bases: object

Build a word-based prediction from the batch output of predictor and the underlying dictionaries.

Replacement based on “Addressing the Rare Word Problem in Neural Machine Translation” []

  • Parameters:
    • (****) (vocabs)
    • (****)
    • n_best (int) – number of predictions produced
    • replace_unk (bool) – replace unknown words using attention

Predictor Classes

class eole.predict.inference.Inference(model, vocabs, config, model_config, device_id=0, global_scorer=None, report_score=True, logger=None, return_gold_log_probs=False)

Bases: object

Predict a batch of sentences with a saved model.

  • Parameters:
    • model (eole.modules.BaseModel) – Model to use for prediction
    • vocabs (dict *[*str , Vocab ]) – A dict mapping each side’s Vocab.
    • config
    • model_config
    • device_id
    • global_scorer (eole.predict.GNMTGlobalScorer) – Prediction scoring/reranking object.
    • report_score (bool) – Whether to report scores
    • logger (logging.Logger or NoneType) – Logger.

predict_batch(batch, attn_debug, streamer=None)

Predict a batch of sentences.

class eole.predict.Translator(model, vocabs, config, model_config, device_id=0, global_scorer=None, report_score=True, logger=None, return_gold_log_probs=False)

Bases: Inference

predict_batch(batch, attn_debug, streamer=None)

Translate a batch of sentences.

class eole.predict.GeneratorLM(model, vocabs, config, model_config, device_id=0, global_scorer=None, report_score=True, logger=None, return_gold_log_probs=False)

Bases: Inference

predict_batch(batch, attn_debug, scoring=False, streamer=None)

Predict a batch of sentences.

  • Parameters:
    • batch – Batch of source data.
    • attn_debug (bool) – Whether to return attention weights.
    • scoring (bool) – Whether to run in scoring mode.
    • streamer (GenerationStreamer , optional) – If provided, tokens are pushed to the streamer at each decoding step to enable token-by-token output streaming.

class eole.predict.Encoder(model, vocabs, config, model_config, device_id=0, global_scorer=None, report_score=True, logger=None, return_gold_log_probs=False)

Bases: Inference

predict_batch(batch, attn_debug, streamer=None)

Predict a batch of sentences.

class eole.predict.AudioPredictor(model, vocabs, config, model_config, device_id=0, global_scorer=None, report_score=True, logger=None, return_gold_log_probs=False)

Bases: Translator

Translator subclass for audio encoder-decoder models.

Adds:

  • Token suppression (suppress_tokens from eole config)
  • Forced decoder prefix (SOT, language, task tokens)
  • Sequential timestamp-seeking: decodes audio windows using timestamp

tokens to determine seek advancement

  • Configurable timestamp output: none (plain text), segment (JSON), word

predict_batch(batch, attn_debug, streamer=None)

Override to inject decoder prefix tensor into batch.

Streaming

class eole.predict.streamer.GenerationStreamer(vocabs, transform_pipe=None, timeout: float = 120.0)

Bases: object

Streamer for token-by-token generation output.

Tokens are put into a thread-safe queue by the generation loop and can be consumed as a Python iterator. The streamer handles incremental detokenization so that consumers receive human-readable text chunks.

This is primarily designed for use with GeneratorLM (decoder-only LLM models). For best results, use with batch_size=1.

  • Parameters:
    • vocabs (dict) – Vocabulary dictionaries from the model.
    • transform_pipe (TransformPipe , optional) – Transform pipeline for detokenization. When provided (typical for HuggingFace / id-tokenization models), full-sequence incremental decoding is used to yield clean text. When None, tokens are looked up directly in the vocabulary.
    • timeout (float) – Maximum seconds to wait for the next token before the iterator stops. Default is 120.0.

Example usage:

import threading
from eole.inference_engine import InferenceEnginePY
from eole.predict.streamer import GenerationStreamer

engine = InferenceEnginePY(config)
streamer = GenerationStreamer(engine.predictor.vocabs,
engine.transform_pipe)

def run():
engine.infer_list(["Hello, how are you?"], streamer=streamer)

thread = threading.Thread(target=run, daemon=True)
thread.start()

for chunk in streamer:
print(chunk, end="", flush=True)

thread.join()

end()

Signal that generation is complete.

Must be called once by the inference thread after the last token has been put, so that the consumer iterator can terminate cleanly.

put(token_ids)

Add newly generated token IDs to the stream.

Called by the generation loop after each decoding step.

  • Parameters: token_ids – A 1-D tensor or list of shape (batch_size,) containing the token IDs produced at the current step. Only the first element is used for streaming.

Decoding Strategies

class eole.predict.decode_strategy.DecodeStrategy(pad, bos, eos, unk, start, batch_size, parallel_paths, global_scorer, min_length, block_ngram_repeat, exclusion_tokens, return_attention, max_length, ban_unk_token, add_estimator)

Bases: object

Base class for generation strategies.

  • Parameters:
    • pad (int) – Magic integer in output vocab.
    • bos (int) – Magic integer in output vocab.
    • eos (int) – Magic integer in output vocab.
    • unk (int) – Magic integer in output vocab.
    • start (int) – Magic integer in output vocab.
    • batch_size (int) – Current batch size.
    • parallel_paths (int) – Decoding strategies like beam search use parallel paths. Each batch is repeated parallel_paths times in relevant state tensors.
    • min_length (int) – Shortest acceptable generation, not counting begin-of-sentence or end-of-sentence.
    • max_length (int) – Longest acceptable sequence, not counting begin-of-sentence (presumably there has been no EOS yet if max_length is used as a cutoff).
    • ban_unk_token (Boolean) – Whether unk token is forbidden
    • block_ngram_repeat (int) – Block beams where block_ngram_repeat-grams repeat.
    • exclusion_tokens (set *[*int ]) – If a gram contains any of these tokens, it may repeat.
    • return_attention (bool) – Whether to work with attention too. If this is true, it is assumed that the decoder is attentional.
  • Variables:
    • pad (int) – See above.
    • bos (int) – See above.
    • eos (int) – See above.
    • unk (int) – See above.
    • start (int) – See above.
    • predictions (list *[*list *[*LongTensor ] ]) – For each batch, holds a list of beam prediction sequences. scores (list[list[FloatTensor]]): For each batch, holds a list of scores.
    • attention (list *[*list *[*FloatTensor or list [ ] ] ]) – For each batch, holds a list of attention sequence tensors (or empty lists) having shape (step, inp_seq_len) where inp_seq_len is the length of the sample (not the max length of all inp seqs).
    • alive_seq (LongTensor) – Shape (B x parallel_paths, step). This sequence grows in the step axis on each call to :func:advance().
    • is_finished (ByteTensor or NoneType) – Shape (B, parallel_paths). Initialized to None.
    • alive_attn (FloatTensor or NoneType) – If tensor, shape is (B x parallel_paths, step, inp_seq_len), where inp_seq_len is the (max) length of the input sequence.
    • target_prefix (LongTensor or NoneType) – If tensor, shape is (B x parallel_paths, prefix_seq_len), where prefix_seq_len is the (max) length of the pre-fixed prediction.
    • min_length (int) – See above.
    • max_length (int) – See above.
    • ban_unk_token (Boolean) – See above.
    • block_ngram_repeat (int) – See above.
    • exclusion_tokens (set *[*int ]) – See above.
    • return_attention (bool) – See above.
    • done (bool) – See above.

advance(log_probs, attn)

DecodeStrategy subclasses should override advance().

Advance is used to update self.alive_seq, self.is_finished, and, when appropriate, self.alive_attn.

block_ngram_repeats(log_probs)

We prevent the beam from going in any direction that would repeat any ngram of size <block_ngram_repeat> more thant once.

The way we do it: we maintain a list of all ngrams of size <block_ngram_repeat> that is updated each time the beam advances, and manually put any token that would lead to a repeated ngram to 0.

This improves on the previous version’s complexity:

  • previous version’s complexity: batch_size * beam_size * len(self)
  • current version’s complexity: batch_size * beam_size

This improves on the previous version’s accuracy;

  • Previous version blocks the whole beam, whereas here we only block specific tokens.
  • Before the prediction would fail when all beams contained repeated ngrams. This is sure to never happen here.

initialize(device=None, target_prefix=None)

DecodeStrategy subclasses should override initialize().

initialize should be called before all actions. used to prepare necessary ingredients for decode.

maybe_update_forbidden_tokens()

We complete and reorder the list of forbidden_tokens

maybe_update_target_prefix(select_index)

We update / reorder target_prefix for alive path.

target_prefixing(log_probs)

Fix the first part of predictions with self.target_prefix.

Args: log_probs (FloatTensor): logits of size (B, vocab_size).

Returns: log_probs (FloatTensor): modified logits in (B, vocab_size).

update_finished()

DecodeStrategy subclasses should override update_finished().

update_finished is used to update self.predictions, self.scores, and other “output” attributes.

class eole.predict.beam_search.BeamSearchBase(beam_size, batch_size, pad, bos, eos, unk, start, n_best, global_scorer, min_length, max_length, return_attention, block_ngram_repeat, exclusion_tokens, stepwise_penalty, ratio, ban_unk_token, add_estimator=False)

Bases: DecodeStrategy

Generation beam search.

Note that the attributes list is not exhaustive. Rather, it highlights tensors to document their shape. (Since the state variables’ “batch” size decreases as beams finish, we denote this axis with a B rather than batch_size).

  • Parameters:
    • beam_size (int) – Number of beams to use (see base parallel_paths).
    • batch_size (int) – See base.
    • pad (int) – See base.
    • bos (int) – See base.
    • eos (int) – See base.
    • unk (int) – See base.
    • start (int) – See base.
    • n_best (int) – Don’t stop until at least this many beams have reached EOS.
    • global_scorer (eole.predict.GNMTGlobalScorer) – Scorer instance.
    • min_length (int) – See base.
    • max_length (int) – See base.
    • return_attention (bool) – See base.
    • block_ngram_repeat (int) – See base.
    • exclusion_tokens (set *[*int ]) – See base.
  • Variables:
    • _batch_offset (LongTensor) – Shape (B,).
    • _beam_offset (LongTensor) – Shape (batch_size x beam_size,).
    • alive_seq (LongTensor) – See base.
    • topk_log_probs (FloatTensor) – Shape (B, beam_size,). These are the scores used for the topk operation.
    • src_len (LongTensor) – Lengths of encodings. Used for masking attentions.
    • select_indices (LongTensor or NoneType) – Shape (B x beam_size,). This is just a flat view of the _batch_index.
    • topk_scores (FloatTensor) – Shape (B, beam_size). These are the scores a sequence will receive if it finishes.
    • topk_ids (LongTensor) – Shape (B, beam_size). These are the word indices of the topk predictions.
    • _batch_index (LongTensor) – Shape (B, beam_size).
    • _prev_penalty (FloatTensor or NoneType) – Shape (B, beam_size). Initialized to None.
    • _coverage (FloatTensor or NoneType) – Shape (1, B x beam_size, inp_seq_len).
    • hypotheses (list *[*list *[*Tuple *[*Tensor ] ] ]) – Contains a tuple of score (float), sequence (long), and attention (float or None).

advance(log_probs, attn)

DecodeStrategy subclasses should override advance().

Advance is used to update self.alive_seq, self.is_finished, and, when appropriate, self.alive_attn.

initialize(*args, **kwargs)

DecodeStrategy subclasses should override initialize().

initialize should be called before all actions. used to prepare necessary ingredients for decode.

update_finished()

DecodeStrategy subclasses should override update_finished().

update_finished is used to update self.predictions, self.scores, and other “output” attributes.

eole.predict.greedy_search.sample_with_temperature(logits, temperature, top_k, top_p)

Select next tokens randomly from the top k possible next tokens.

Samples from a categorical distribution over the top_k words using the category probabilities logits / temperature.

  • Parameters:
    • logits (FloatTensor) – Shaped (batch_size, vocab_size). These can be logits ((-inf, inf)) or log-probs ((-inf, 0]). (The distribution actually uses the log-probabilities logits - logits.logsumexp(-1), which equals the logits if they are log-probabilities summing to 1.)
    • temperature (float) – Used to scale down logits. The higher the value, the more likely it is that a non-max word will be sampled.
    • top_k (int) – This many words could potentially be chosen. The other logits are set to have probability 0.
    • top_p (float) – Keep most likely words until the cumulated probability is greater than p. If used with top_k: both conditions will be applied
  • Returns:
    • topk_ids: Shaped (batch_size, 1). These are the sampled word indices in the output vocab.
    • topk_scores: Shaped (batch_size, 1). These are essentially (logits / temperature)[topk_ids].
  • Return type: (LongTensor, FloatTensor)

Scoring

class eole.predict.penalties.PenaltyBuilder(cov_pen, length_pen)

Bases: object

Returns the Length and Coverage Penalty function for Beam Search.

  • Parameters:
    • length_pen (str) – option name of length pen
    • cov_pen (str) – option name of cov pen
  • Variables:
    • has_cov_pen (bool) – Whether coverage penalty is None (applying it is a no-op). Note that the converse isn’t true. Setting beta to 0 should force coverage length to be a no-op.
    • has_len_pen (bool) – Whether length penalty is None (applying it is a no-op). Note that the converse isn’t true. Setting alpha to 1 should force length penalty to be a no-op.
    • coverage_penalty (callable [ *[*FloatTensor , float ] , FloatTensor ]) – Calculates the coverage penalty.
    • length_penalty (callable [ *[*int , float ] , float ]) – Calculates the length penalty.

coverage_none(cov, beta=0.0)

Returns zero as penalty

coverage_summary(cov, beta=0.0)

Our summary penalty.

coverage_wu(cov, beta=0.0)

GNMT coverage re-ranking score.

See “Google’s Neural Machine Translation System” []. cov is expected to be sized (*, seq_len), where * is probably batch_size x beam_size but could be several dimensions like (batch_size, beam_size). If cov is attention, then the seq_len axis probably sums to (almost) 1.

length_average(cur_len, alpha=1.0)

Returns the current sequence length.

length_none(cur_len, alpha=0.0)

Returns unmodified scores.

length_wu(cur_len, alpha=0.0)

GNMT length re-ranking score.

See “Google’s Neural Machine Translation System” [].

class eole.predict.GNMTGlobalScorer(alpha, beta, length_penalty, coverage_penalty)

Bases: object

NMT re-ranking.

  • Parameters:
    • alpha (float) – Length parameter.
    • beta (float) – Coverage parameter.
    • length_penalty (str) – Length penalty strategy.
    • coverage_penalty (str) – Coverage penalty strategy.
  • Variables: